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This paper studies statistical aggregation procedures in the re- 
gression setting. A motivating factor is the existence of many differ- 
ent methods of estimation, leading to possibly competing estimators. 
We consider here three different types of aggregation: model selection 
(MS) aggregation, convex (C) aggregation and linear (L) aggregation. 
The objective of (MS) is to select the optimal single estimator from 
the list; that of (C) is to select the optimal convex combination of the 
given estimators; and that of (L) is to select the optimal linear com- 
bination of the given estimators. We are interested in evaluating the 
rates of convergence of the excess risks of the estimators obtained by 
these procedures. Our approach is motivated by recently published 
minimax results [Nemirovski, A. (2000). Topics in non-parametric 
statistics. Lectures on Probability Theory and Statistics (Saint-Flour, 
1998). Lecture Notes in Math. 1738 85-277. Springer, Berlin; Tsy- 
bakov, A. B. (2003). Optimal rates of aggregation. Learning Theory 
and Kernel Machines. Lecture Notes in Artificial Intelligence 2777 
303-313. Springer, Heidelberg]. There exist competing aggregation 
procedures achieving optimal convergence rates for each of the (MS), 
(C) and (L) cases separately. Since these procedures are not directly 
comparable with each other, we suggest an alternative solution. We 
prove that all three optimal rates, as well as those for the newly in- 
troduced (S) aggregation (subset selection), are nearly achieved via 
a single "universal" aggregation procedure. The procedure consists 
of mixing the initial estimators with weights obtained by penalized 
least squares. Two different penalties are considered: one of them is 
of the B1C type, the second one is a data-dependent ^i-type penalty. 

1. Introduction. In this paper we study aggregation procedures and their 
performance for regression models. Let T> n = {(Xi,Yi), . . . , (X n ,Y n )} be a 
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sample of independent random pairs (Xi,Yi) with 
(1.1) Y i = f(X i ) + W u i = l,...,n, 

where / : X — > R is an unknown regression function to be estimated, X is a 
Borel subset of R d , the X f 's are fixed elements in X and the errors Wj are 
zero mean random variables. 

Aggregation of arbitrary estimators in regression models has recently re- 
ceived increasing attention [9, 15, 23, 26, 34, 40, 42, 43, 44, 45]. A motivating 
factor is the existence of many different methods of estimation, leading to 
possibly competing estimators f\, . . . , /m- A natural idea is then to look for 
a new, improved estimator / constructed by combining f±, . . . , /m in a suit- 
able way. Such an estimator / is called an aggregate and its construction is 
called aggregation. 

Three main aggregation problems are model selection (MS) aggregation, 
convex (C) aggregation and linear (L) aggregation, as first stated by Ne- 
mirovski [34]. The objective of (MS) is to select the optimal (in a sense to 
be defined) single estimator from the list; that of (C) is to select the optimal 
convex combination of the given estimators; and that of (L) is to select the 
optimal linear combination of the given estimators. 

Aggregation procedures are typically based on sample splitting. The ini- 
tial sample V n is divided into a training sample, used to construct estimators 
/l, . . . , /jvf,^and an independent validation sample, used to learn, that is, to 
construct /. In this paper we do not consider sample splitting schemes but 
rather deal with an idealized scheme. We fix the training sample, and thus 
instead of estimators f±, . . . , Jm, we have fixed functions f±, . . . , Jm- A pas- 
sage to the initial model in our results is straightforward: conditioning on 
the training sample, we write the inequalities of Theorems 3.1 and 4.1 or, 
for example, (1.2) below. Then, we take expectations on both sides of these 
inequalities over the distribution of the whole sample T> n and interchange 
the expectation and infimum signs to get bounds containing the risks of 
the estimators on the right-hand side. The fixed functions /i,-.-,/m can 
be considered as elements of an (overdetermined) dictionary, see [19] , or as 
"weak learners," see [37] , and our results can be interpreted in such contexts 
as well. 

To give precise definitions, denote by \\g\\ n = {n~ l *£h=i 9 2 (Xi)} l l 2 the 
empirical norm of a function g in R d and set f\ = J2j=i^jfj f° r an Y ^ = 
(Ai, . . . , Xm) £ R M - The performance of an aggregate / can be judged against 
the mathematical target 




where &. n ,M > is a remainder term independent of f characterizing the 
price to pay for aggregation, and the set H M is either the entire R M (for 
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linear aggregation), the simplex A M = {A = (Ai, . . . , Xm) £ K M : Aj > 0, 
J2jLi Aj < 1} (for convex aggregation), or the set of all vertices of A M , 
except the vertex (0, ...,0) £l if (for model selection aggregation). Here 
and later denotes the expectation with respect to the joint distribution 
of (Xi,Yi), . . . , (X n , Y n ) under model (1.1). The random functions f\ attain- 
ing inix£H M IKa — f\\n m (1-2) for the three values taken by H M are called 
(L), (C) and (MS) oracles, respectively. Note that these minimizers are not 
estimators since they depend on the true /. 

We also introduce a fourth type of aggregation, subset selection, or (S) ag- 
gregation. For (S) aggregation we fix an integer D < M and put H M = A M,D , 
where A M,D denotes the set of all A 6 M. M having at most D nonzero coor- 
dinates. Note that (L) aggregation is a special case of subset selection [(S) 
aggregation] for D = M. The literature on subset selection techniques is very 
large and dates back to [1, 33, 38]. We refer to the recent comprehensive sur- 
vey [36] for references on methods geared mainly to parametric models. For 
a review of techniques leading to subset selection in nonparametric settings 
we refer to [7] and the references therein. 

We say that the aggregate / mimics the (L), (C), (MS) or (S) oracle if it 
satisfies (1.2) for the corresponding set H M , with the minimal possible price 
for aggregation A Ut M- Minimal possible values A Ut M for the three problems 
can be defined via a minimax setting and they are called optimal rates of 
aggregation [40] and further denoted by ip Uj M- Extending the results of [40] 
obtained in the random design case to the fixed design case, we will show in 
Sections 3 and 5 that under mild conditions 



n,M 



' M/n, for (L) aggregation, 

{D log(l + M/D)}/n, for (S) aggregation, 
M/n, for (C) aggregation, if M < ^/n, 

J {log(l + M/ A /n)}/n, for (C) aggregation, if M > y/n, 

K (log M)/n, for (MS) aggregation. 

(1.3) 

This implies that linear aggregation has the highest price, (MS) aggregation 
has the lowest price and convex aggregation occupies an intermediate place. 
The oracle risks on the right-hand side in (1.2) satisfy a reversed inequality, 

inf \\fi-f\\ 2 n > inf \\h-f\\l> mf ||f A - f\g, 

since the sets over which the infima are taken are nested. There is no winner 
among the three aggregation techniques and the question of how to choose 
the best among them remains open. 

The ideal oracle inequality (1.2) is available only for some special cases. 
See [13, 15, 27] for (MS) aggregation, [25, 26, 34, 40] for (C) aggregation with 
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M > \fn and [40] for (L) aggregation and for (C) aggregation with M < y/n. 
For more general situations there exist less sharp results of the type 




where e > is a constant independent of / and n, and A Ut M is a remainder 
term, not necessarily having the same behavior in n and M as the optimal 

one ip n ,M- 

Bounds of the type (1.4) in regression problems have been obtained by 
many authors mainly for the model selection case; see, for example, [4, 5, 7, 8, 
9, 10, 11, 12, 15, 23, 28, 30, 32, 42] and the references cited in these works. 
Most of the papers on model selection treat particular restricted families 
of estimators, such as orthogonal series estimators, spline estimators, and 
so forth. There are relatively few results on (MS) aggregation when the 
estimators are allowed to be arbitrary; see [9, 13, 15, 23, 27, 40, 42, 43, 44, 45]. 
Various convex aggregation procedures for nonparametric regression have 
emerged in the last decade. The literature on oracle inequalities of the type 
(1.2) and (1.4) for the (C) aggregation case is not nearly as large as the 
one on model selection. We refer to [3, 9, 13, 25, 26, 29, 34, 40, 43, 44, 45]. 
Finally, linear aggregation procedures are discussed in [13, 34, 40]. 

Given the existence of competing aggregation procedures achieving either 
optimal (MS), (C) or (L) bounds, there is an ongoing discussion as to which 
procedure is the best one. Since this cannot be decided by merely comparing 
the optimal bounds, we suggest an alternative solution. We show that all 
three optimal (MS), (C) and (L) bounds can be nearly achieved via a single 
aggregation procedure. We also show that this procedure leads to near op- 
timal bounds for the newly introduced (S) aggregation, for any subset size 
D. Our answer will thus meet the desiderata of both model (subset) selec- 
tion and model averaging. The procedures that we suggest for aggregation 
are based on penalized least squares, with the BIC-type or Lasso (^i-type) 
penalties. 

The paper is organized as follows. Section 2 introduces notation and as- 
sumptions used throughout the paper. In Section 3 we show that a BIC-type 
aggregate satisfies inequalities of the form (1.4) with the optimal remainder 
term tp n M- We establish the oracle inequalities for all four sets H M under 
consideration, hence showing that the BIC-type aggregate achieves simul- 
taneously the (S) [and hence the (L)], the (C) and the (MS) bounds. In 
Section 4 we study aggregation with the t\ penalty and we obtain (1.4) si- 
multaneously for the (S), (C) and (MS) cases, with a remainder term A n> M 
that differs from the optimal VVi,m only by a logarithmic factor. We give the 
corresponding lower bounds for (S), (C) and (MS) aggregation in Section 5, 
complementing the results obtained for the random design case in [40]. All 
proofs are deferred to the appendices. 
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2. Notation and assumptions. The following two assumptions on the re- 
gression model (1.1) are supposed to be satisfied throughout the paper. 

Assumption (Al) . The random variables Wi are independent and Gaus- 
sian N(0,a 2 ). 

Assumption ( A2) . The functions / : X -> R and fj : X ->■ R, j = 1, . . . , 
M, with M > 2, belong to the class J-q of uniformly bounded functions 
defined by 

V= \g:X^R sup| ff (x)|<L}, 

where L < oo is a constant that is not necessarily known to the statistician. 

The functions fj can be viewed as estimators of / constructed from a 
training sample. Here we consider the ideal situation in which they are fixed; 
we concentrate on learning only. For each A = (Ai, . . . , Am) £ R M , define 

M 
i=i 

and let M(A) denote the number of nonzero coordinates of A, that is, 

M 

M(A)=]T/ {A ^ 0} = CardJ(A), 

3=1 

where Ir.\ denotes the indicator function and J(A) = {j £ {1, . . . , M} : Xj ^ 
0}. Furthermore, we introduce the residual sum of squares 

1 n 

Ti . 

1 = 1 

and the function 

for all A G R M . The method that we propose is based on aggregating the 
/j's via penalized least squares. Given a penalty term pen(A), the penalized 
least squares estimator A = (Ai, . . . , Am) is defined by 

(2.1) A = argmin{5(A) +pen(A)}, 

AeR M 

which renders in turn the aggregated estimator 

(2.2) 7(x) = ^ x (x). 
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Although A is not necessarily unique, all our oracle inequalities hold for any 
minimizer (2.1). Since the vector A can take any values in M M , the aggregate 
/ is not a model selector in the traditional sense, nor is it necessarily a convex 
combination of the functions fj. Nevertheless, we will show that it mimics 
the (S), (C) and (MS) oracles when one of the following two penalties is 
used: 

(2.3) pen(A) = ^{l + + ^Wm(A) 

n I 1 + a v a J 

or 

(2.4) pen(A) = 2 V2,J^±^i;|A,|||/ j ||„. 

In (2.3), a > is a parameter to be set by the user. The penalty (2.3) can be 
viewed as a variant of BIC-type penalties [21, 38] since pen(A) ~ M(A), but 
the scaling factor here is different and depends on M(A). We also note that 
in the sequence space model (where the functions /i , . . . , Jm are orthonormal 
with respect to the scalar product induced by the norm || • || n ), the penalty 
pen(A) ~ M(A) leads to Aj's that are hard thresholded values of the Yj's 
(see, e.g., [24], page 138). Our penalty (2.3) is not exactly of that form, but 
it differs from it only in a logarithmic factor. 

The penalty (2.4), again in the sequence space model, leads to A,'s that 
are soft thresholded values of IVs. In general models, (2.4) is a weighted 
^i-penalty, with data-dependent weights. Penalized least squares estimators 
with £i-penalty pen(A) ~ J2jLi \ ^j\ are closely related to basis pursuit [17], 
to Lasso-type estimators [2, 26, 34, 39] and LARS [20]. 

Our results show that the BIC-type penalty (2.3) allows optimal aggre- 
gation under (Al) and (A2). The £i-penalty (2.4) allows near optimal ag- 
gregation under somewhat stronger conditions. 



3. Optimal aggregation with a BIC-type penalty. In this section we show 
that the penalized least squares aggregate (2.2) corresponding to the penalty 
term (2.3) achieves simultaneously the (L), (C) and (MS) bounds of the form 
(1.4) with the optimal rates A U) m = tpn,M- Consequently, the smallest bound 
is achieved by our aggregate. The next theorem presents an oracle inequality 
that implies all three bounds, as well as a bound for (S) aggregation. 

Theorem 3.1. Assume (Al) and (A2). Let f be the penalized least 
squares aggregate defined in (2.2) with penalty (2.3). Then, for all a> and 
all integers n > 1 and M > 2, 



E f \\f-ff n <(l + a) inf " 



h-f\\ 2 n + -h + —L(\)}M(X) 
n I a J 
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(3.1) 

a 2 6(l + g) 2 
n a(e — 1) 

The proof is given in Appendix A. 

Corollary 3.2. Under the conditions of Theorem 3.1, there exists a 
constant C > such that for all a> and all integers n>\ and M > 2 and 
D < M , the following upper bounds for Rm,u = f E/||/ — f\\n hold: 



(3.2) 


Rm,u 


< (l + o) 


i<j<m" j 


-ff n + C(l + a + a 


J " 

n 


(3.3) 


RM,n 


<(l + o) 


inf ||f A 


- f\\l + C(l + a + a 




(3.4) 


Rm,u 


< (1 + a) 


inf ||f A - 


fWl + Cil + a + a' 1 


n 


(3.5) 


Rm,u 


< (l + o) 


inf |[f A - 
AeA M 


ff n + C {l + a + a- 1 


)(L 2 + a 2 )^(M), 


where 













M/n, ifM<y/n, 
'{logfeM/v^J/n, i/ M > ^/n. 



The proof is given in Appendix A. 

Note that, along with the bounds of Corollary 3.2, Theorem 3.1 implies a 
trivial (constant) bound on Rm,u- In fact, the infimum over A<ER M in (3.1) 
is smaller than the value of the expression in square brackets for A = 0, which 
together with Assumption (A2) yields R M , n < 0- + + ^ ^fezfr • There- 
fore, the remainder terms in (3.2)-(3.5) can be replaced by their truncated 
versions (truncation by a constant). 

A variant of Theorem 3.1 for regression with random design X\, . . . ,X n 
can be found in [14]. 

Remark 1 . The variance a 2 is usually unknown and we need to substi- 
tute an estimate in the penalty (2.3). We consider the situation described in 
the introduction where the functions fj are estimators based on an indepen- 
dent (training) data set V' e that consists of observations {X'^Y-) following 

(1.1). Let a 2 be an estimate of a 2 based on T>' t only. We write Ej 1 ^ (Ej 2 ^) for 

expectation with respect to V' t (V n ), and let Ej = EpE^ be the product 

expectation. Let / be the aggregate corresponding to the penalty (2.3) with 
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a 2 replaced by a 2 . Note that 

E f \\f-f\\l = Ef^Ef ||/-/||^ 2 > CT2} +E«Ef ||/-/||*J {sCa<B3} . 

Inspection of the proof of Theorem 3.1 shows that E^||/ — f\\n^{2a 2 >a 2 } 
can be bounded simply by the right-hand side of (3.1) with a 2 substituted 
by 2a 2 , as Theorem 3.1 holds for any penalty term larger than (2.3). Con- 
sequently we find 

%ii/-/ii^ {2 ? 2 > ct2} 

^ 2Efd 2 Q(i + a f 



n a(e — 1) 



+ (l + a) inf 



n { a J 



Next, we observe that Ey ||/ — /||^ < 6u 2 + 21? . For this, we use the rea- 
soning leading to (A. 5) in the proof of Theorem 4.1, in which we replace Ia c 
by 1 throughout. Notice that this argument holds for any positive penalty 
term pen(A) such that pen(Ao) = with Ao = (0, . . . , 0), and hence it holds 
for the penalty term used here. Thus 

E/ll/ " f\\lh^«r>} ^ ^ + ZL 2 Wf{2a 2 < a 2 }. 

Combining the three displays above we see that / achieves a bound similar 
to (3.1) if the estimator a 2 satisfies {2d 2 < a 2 } < a/n and E^ct 2 < c 2 a 2 
for some finite constants c\ , c 2 . Since the sample variance of Y( from the 
training sample T>' i: with i> cn for some positive constant c, meets both 
requirements, it can always play the role of a 2 . 

4. Near optimal aggregation with a data-dependent t\ penalty. In this 
section we show that the penalized least squares aggregate (2.2) using a 
penalty of the form (2.4) achieves simultaneously the (MS), (C), (L) and 
(S) bounds of the form (1.4) with near optimal rates A n> M- We will use the 
following additional assumption. 

Assumption (A3). Define the matrices 

V i=l / l<j,j'<M 

diag(1V t ) = diag(||/i|| 2 , . . . , \\f M \\ 2 n ). 

There exists k = k U) m > such that the matrix fy n — /cdiag(^ r n ) is positive 
semidefinite for any given n > 1, M > 2. 
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The next theorem presents an oracle inequality similar to the one of The- 
orem 3.1. 

Theorem 4.1. Assume (Al), (A2) and (A3). Let f be the penalized 
least squares aggregate defined by (2.2) with penalty (2.4). Then, for all 
e > 0, and all integers n > 1, M > 2, we have 

E/II/-/K 

(4.i) < infid + ,)| ft - /a + s(4 +£ + 1) g '° gM + log " M ( A)} 

4L 2 + 12cx 2 2 /nT2" / n 

H , ; + ocr W exp 

nv/vr(logM + logn) V n V 16 

The proof is given in Appendix A. 

Corollary 4.2. Let the assumptions of Theorem 4.1 be satisfied. Then 
there exists a constant C = C(L 2 ,a 2 , k) > such that for all e > and for 
all integers n > 1, M > 2 and D < M, 

- /II* < (1 + e) ^ - / ||* + C(l + e + £ - 1 ) lQg( ^ Vre) i 

E/H7- /IIS < (1 + e) inf II ^A /||S + C(l + e + r 1 )^^, 
AeA M .° n 

■/I / - / 111 < (1 + •) inf |f» " SWl + £7(1 + * + £ -') M '° g(MV " ) , 
AeK M n 

1/11/ - /||S <(! + £) inf ||f A " /IIS + C(l + e + e-^XM), 
AeA iU 

where 

-ff (m = [ ( M log n)/n, i/ M < y^, 
^ n 1 ; \ y/(]ogM)/n, if M > 0i. 

Proof. The argument is similar to that of the proof of Corollary 3.2. 

□ 

Remark 2. Using the same reasoning as in Remark 1, we can replace 
a 2 in the penalty term by twice the sample variance of Y- from the training 
sample VL 



Remark 3. Inspection of the proofs shows that the constants C = 
C(L 2 ,cf 2 ,k) in Corollary 4.2 have the form C = A\ + A2/K, where A\ and 
A2 are constants independent of k. In general, k may depend on n and M. 
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However, if n > c for some constant c > 0, independent of n and M, as dis- 
cussed in Remarks 3 and 4 below, the rates of aggregation given in Corollary 
4.2 are near optimal, up to logarithmic factors. They are exactly optimal 
[cf. (1.3) and the lower bounds of the next section] for some configurations 
of n,M: for (MS)-aggregation if n a < M < n a , and for (C)-aggregation if 
re 1 / 2 < M < n a , where < a' < a < oo. 

Remark 4. If £ min , the smallest eigenvalue of the matrix ^f n , is posi- 
tive, Assumption (A3) is satisfied for k = £mm/L • In a standard parametric 
regression context where M is fixed and ^ n /n converges to a nonsingular 
M x M matrix, we have that Cmm > c (and therefore k > c/L 2 ) for n large 
enough and for some c > independent of M and n. 

Remark 5. Assumption (A3) is trivially satisfied with k = 1 if ^f n is 
a diagonal matrix (note that zero diagonal entries are not excluded). An 
example illustrating this situation is related to the orthogonal series non- 
parametric regression: M = M n is allowed to converge to oo as n — > oo and 
the basis functions fj are orthogonal with respect to the empirical norm. 
Another example is related to sequence space models, where the fj are esti- 
mators constructed from nonintersecting blocks of coefficients. Aggregation 
of such mutually orthogonal estimators can be used to achieve adaptation 
(cf., e.g., [34]). Note that Assumption (A3) does not exclude the matrices 
fy n whose ordered eigenvalues can be arbitrarily close to as M — ► oo. The 
last property is characteristic for sequence space representation of statisti- 
cal inverse problems: there *$> n is diagonal, with M = M n — > oo, as n — ► oo, 
and with the jth eigenvalue tending to as j — > oo. For such matrices ^ n 
Assumption (A3) holds with k = 1, so that the oracle inequality of Theorem 
4.1 is invariant with respect to the speed of decrease of the eigenvalues. 

Remark 6. The bounds of Corollary 4.2 can be written with the re- 
mainder terms truncated at a constant level (cf. an analogous remark after 
Corollary 3.2). Thus, for M > n the (L) bounds become trivial. 

However, for M > n an oracle bound of the type (4.1) is still meaningful 
if / is sparse, that is, can be well approximated by relatively few (less than 
n) functions fj. This is illustrated by the next theorem where Assumption 
(A3) is replaced by a local mutual coherence property of the matrix 
relaxing the mutual coherence condition suggested in [19]. Let 

j\ {fit fj)n 

1 1 fi 1 1 n 1 1 fj 1 1 n 

denote the "correlation" between two elements fi and fj. We will assume 
that the values PM(i,j) with i j are relatively small, for i E J(A), A S M M . 
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Set 



p(X) = max max \p n (i, j) 

ieJ(A) j>i 



Theorem 4.3. Assume (Al) and (A2). Let f be the penalized least 
squares aggregate defined by (2.2) with penalty 



/ ■ M 

pen(A) = 4V2J 1 ° 8M n + ' OS " S|A J |||/ ) ||„. 
V j=\ 

Then, for all e > and all integers n > 1, M > 2, we /iave 

/IIS < mf{(l + £ )||fA - /||^ + 32(4 + £ + l)a 2 l0gM + l0gre M(A)' 

4L 2 + 12cr 2 2 /n+~2 / n 

H , ; + 6<T \/ exp 

?V7r(log M + logn) V n V 16 

where the infimum is taken over all X G R M suc/i */iai 32p(A)M(A) < 1. 
The proof is given in Appendix A. 

In particular, if / has a sparse representation / = f^* for some A* £ 1R M 
with 32p(A*)M (A*) < 1, there exists a constant C = C(L 2 ,a 2 ) < oo such 
that 



E/||/-/||n<C(logM + logn): 

for all n > 1 and M > 2. Even for M > n, this bound is meaningful if 
M(A*) < n. 

Note that in Theorem 4.3 the correlations p n (hj) with i,j ^ </(A) can 
take arbitrary values in [—1,1]. Such p n (i,j) constitute the overwhelming 
majority of the elements of the correlation matrix if J(A) is a set of small 
cardinality, M (A) <C M. 

Remark 7. An attractive feature of the ^-penalized aggregation is its 
computational feasibility. Clearly, the criterion in (2.1) with penalties as in 
Theorems 4.1 and 4.3 is convex in A. One can therefore use techniques of 
convex optimization to compute the aggregates. We refer, for instance, to [20, 
35] for detailed analysis of such optimization problems and fast algorithms. 

Remark 8. We refer to Theorem 2.1 in [35] for conditions under which 
the penalized least squares aggregate is unique. Typically, for M > n the 
solution is not unique, but a convex combination of solutions is itself a 
solution. Our results hold for any element of such a convex set of solutions. 
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5. Lower bounds. In this section we provide lower bounds showing that 
the remainder terms in the upper bounds obtained in the previous sections 
are optimal or near optimal. For regression with random design and the 
L2(M. d , /i)-risks, such lower bounds for aggregation with optimal rates ip n> M 
as given in (1.3) were established in [40]. The next theorem extends them 
to aggregation for the regression model with fixed design. Furthermore, we 
state these bounds in a more general form, considering not only the expected 
squared risks, but also other loss functions, and instead of the (L) aggrega- 
tion lower bound, we provide the more general (S) aggregation bound. 

Let > [0, oo) be a loss function, that is, a monotone nondecreasing 

function satisfying w(0) = and w ^ 0. 

Theorem 5.1. Let the integers n,M,D be such that 2 < M < n, and 
let X\, . . . ,X n be distinct points. Assume that H M is either the simplex A M 
[for the (C) aggregation case], the set of vertices of K M , except the vertex 
(0, . .. ,0) € M M [for the (MS) aggregation case], or the set A M ' D [for the (S) 
aggregation case]. Let the corresponding i/) n M be given by (1.3) and, for (S) 
aggregation, assume that Mlog(M / D + 1) < n and M > D. Then there exist 
fi, ■ • ->/m e Tq such that 



(5.1) inf sup KfW 



^M(rn-/||^- A inf M ||fA-/||; 



>c, 



where infy n denotes the infimum over all estimators and the constant c > 
does not depend on n,M and D. 

The proof is given in Appendix A. 

Setting w(u) = u in Theorem 5.1, we get the lower bounds for expected 
squared risks showing optimality or near optimality of the remainder terms 
in the oracle inequalities of Corollaries 3.2 and 4.2. The choice of w(u) = 
I{u > a} with some fixed a > leads to the lower bounds for probabilities 
showing near optimality of the remainder terms in the corresponding upper 
bounds "in probability" obtained in [14]. 

APPENDIX A: PROOFS 

A.l. Proof of Theorem 3.1. Let A m be the set of A G M M with exactly m 
nonzero coefficients, A m = {A G M M : M(A) = m}. Let J m ,k, k = l,..., (^) , 
be all the subsets of {1, . . . , M} of cardinality m and define 

A m , fe (A) = {A = (Ai, . . . , \ M ) g A m : Xj ^ &j e J m ,k}- 

The collection {A m ^ : 1 < k < (_,)} forms a partition of the set A m . Observe 
that 

inf {5(A) + pen(A)}= inf inf inf {S(X) + pen(A)}. 

AGK M 0<m<M 1<k< M\ AGA m , fc 

— —\m/ 
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Here the penalty pen(A) is defined in (2.3), and it takes a constant value 
on each of the sets A m as M(A) = m and L(\) = L m = 21n(eM/(m V 1)) for 
all A € A m . We now apply [11], Theorem 2, choosing there the parameters 
6 = a/(l + a) and K = 2. This yields 

E / ||/-/||£<(l + a) inf inf { inf ||f A - f\\ 2 n + pen(A) - — j 

— — Vm / 

, (l + a) 2 cr 2 /(2 + a " 



J i ti v> + 2 

a n 1(1 + cl) 

where E = £m=i Q exp(-mL m ). Using the crude bound < (eM/r 
(see, e.g., [18], page 218), we get 

M / .a -m M 



/eM\" m . ^ _„ 1 



n 

For all A E A m , we have 



Am/ 

m=l m=l 



2 + a /— — 1 + a 

n pen( A) — ma = cr m 1 + 2 V-^m + 2 L m 

1 + a a 



, 2 + 3a r 
< a m I 5 H L m 



Consequently we find 
E / ||/-/|| 2 <(l + a) 



x inf inf { inf ||/ - f A || 2 + — ( 5 + ^L m )\ 



+ 



0<m<M 1<k< (M) 

— —\m/ 

6(1 + a) 2 a 2 
a(e — 1) n 



(1 + a) inf 

AeR M 



^ 6(1 + a) 2 a 2 



n I a 



a(e — 1) n 
which proves the result. 

A.2. Proof of Corollary 3.2. 

Proof of (3.2). Since the infimum on the right-hand side of (3.1) is 
taken over all A £ R M , the bound easily follows by restricting the minimiza- 
tion to the set of the M vertices (1,0,..., 0), (0, 1, 0, . . . , 0), . . . , (0, . . . , 0, 1) 
of A M . □ 
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Proof of (3.3) and (3.4). The (S) bound (3.3) easily follows from 
(3.1) by restricting the minimization to A M,D . In fact, for A G A M ' D we have 
M(A) = D and L(X) = 2 Iog(eAf/D) < 6 log(M/D + 1) . The (L) bound (3.4) 
is a special case of (3.3) for D = M. □ 



Proof of (3.5). For M < s/n the result follows from (3.4). Assume 
now that M > y/n and let m > 1 be the smallest integer greater than or 
equal to 



x n ,M = v / ^/(2y / log(eM/ v / ^)). 



Clearly, x n) M < m < x n ,M + 1 < M. First, consider the case m > 1. Denote 
by C the set of functions h of the form 

j iU M 

^(x) = — ^2kjfj(x), kj G {0,1,..., m}, ^%<?n. 

The following approximation result can be obtained by the "Maurey argu- 
ment" (see, e.g., [6], Lemma 1 or [34], pages 192 and 193): 

(A.l) mm \\g - f\\l < mm ||f A - ff n + — . 

g&C agA m m 

For completeness, we give the proof of (A.l) in Appendix B. Since M(A) < 
m < x n: M + 1 for the vectors A corresponding to g G C, and since x \— > 
xlog(^) is increasing for 1 < x < M, we get from (3.1) 

x„.m + 1, / eM \1 , C 3 



K/ll/ - /Hn < inf (Cxllp - /IIS + C 2 ^±± log 



with Ci = 1 + a,C 2 = C' 2 (1 + a + l/a)a 2 and C 3 = C 3 (l + a + l/a)a 2 , where 
C 2 > and C3 > are absolute constants. Using this inequality, (A.l) and 
the fact that m > x n .M, we obtain 

%ll7 - /US < Ci inf ||f A - /US + C,J- + C 2 X -^l±± loJ^L) + ^. 

AeA M X n ,M n \X n:M J Tl 

Since, clearly, ji" 1 < ip~(M), to complete the proof of (3.5) it remains to 
note that 



in view of the elementary inequality log(2yylog(yj) < 31og(y), for all y > e. 
□ 
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A. 3. Proof of Theorem 4.1. We begin as in [31]. First we define 



r n = 2yW l0gM + 1 ° gn 
V n 

and r n j =r n ||/j|| n - By definition, / = f~ satisfies 

M M 

for all AGl M , which we may rewrite as 

M M r, n 

ll/-/lln+E^I A il^ll f A-/lln + E^I A J l + -E^(/- f A)(^)- 
j=l j=l i=l 

We define the random variables = - J^i=i /jC^i)Wi5 1 < i < M, and the 
event 

A/ 

A=P\{2\V j \<r nd }. 

3=1 

The normality Assumption (Al) on VFj implies that y^nVj- ~ iV(0,<T 2 ||/j||^), 
1 < j < . Applying the union bound followed by the standard tail bound 
for the iV(0, 1) distribution, we find 

(A.2) P(A C ) < E p {v^l^'l > < 

~~ x n^/Tvylog M + logra) 

Then, on the set A, we find 

2 n M M 

- E - fA)(A.) = 2 E - Xj) < E r^fc - A, I 

i=l j=l j=l 

and therefore, still on the set A, 

M M M 

\\f-f\\l< \\h - f\\l + E^-ia, - a, • E A < - E^fei- 

3=1 3=1 3=1 

Recall that J(A) denotes the set of indices of the nonzero elements of A, and 
M(A) = Card J(A). Rewriting the right-hand side of the previous display, 
we find, on the set A, 

||/-/||^<||fA-/||^+fE^3l A 3'- A 3'|- E rnjlXjl) 

V=l jfSJ(A) / 
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(A-3) +(- J2 r n,jfij\+ E r n j\\j\\ 

\ jeJ(\) jeJ(X) / 

<||f A -/||2+2 ]T '-'-J X J X J 

jeJ(x) 

by the triangle inequality and the fact that Xj = for j ^ J(A). By Assump- 
tion (A3), we have 

M 

E <A - A / <^E WfifA - A / = - A)'diag(*„)(A - A) 
jeJ(A) j=i 



< r 2 n K-\\ - A)'* n (A - A) = r 2 ^ 1 1|/ - f A | 



2 



Combining this with the Cauchy-Schwarz and triangle inequalities, we find 
further that, on the set A, 

||/-/||*<||fA-/l£ + 2 E 'V/A, A, 

ieJ(A) 

(A.4) 

< ||f A - f\\ 2 n + 2r nV /M(A)M||/ - /|U + ||f A - /|| n ). 

Inequality (A.4) is of the simple form v 2 < c 2 + vft + c6 with u = ||/ — 
f\\ n , b = 2r n \J 'M(X) j ' k and c = ||f A — After applying the inequality 
2xy < x 2 j 'a + ay 2 (i,|/6l,a>0) twice, to 26c and 2bv, we easily find 
v 2 < v 2 /(2a) + ab 2 + (2a + l)/(2a)c 2 , whence v 2 < a/(a-l){b 2 (a/2) + c 2 (a + 
l)/a} for a = 2a > 1. Recalling that (A.4) is valid on the set A, we now get 
that 

Ml/ - /IIW < A sf„{^I l|fA - + ^T)'» M(A) } v " > L 

It remains to bound E/||/ — fW^I^. Writing ||W||£ = n -1 X)ft=i an d using 
the inequality (x + y) 2 < 2x 2 + 2y 2 , we find that 

" fWllA* < 2E f S(f)I A c + 2E / ||^|| 2 / AC . 

Next, since pen(A) > and by the definition of /, for Aq = (0, . . . , 0)' £ W M , 

EfS(f)I A ° < %{S(7) + pen(A)}/ Ac < E f {5(/ Ao ) + pen(A )}/A C 
= E f S(f Xo )I A c < 2L 2 P(A C ) + 2E / ||W|| 2 / A c, 

whence 

(A.5) E f \\f- f\\ 2 J Ac < 4L 2 ¥(A C ) + 6E f \\W\\ 2 n I A c 
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In order to bound the last term on the right-hand side, we introduce the 
event B = {^Ya=iWi < 2a 2 }. Using Lemma B.2 from Appendix B with 
d = n, we get 

¥{B C } = ¥{Zl -n> v^yn/^} < exp(-ra/8). 

Observe further that E/||W||£l A c < 2cr 2 P{^ c } +E/||W||£Jbc an d by the 
Cauchy-Schwarz inequality we find 

/ n — 1 \ ^ 2 
^fWWWllB- < (E/||W||*) 1/2 exp(-n/16) = + a 4 exp(-ra/16). 



n n 

Collecting all these bounds, and using the bound (A. 2) on P{^4 C }, we obtain 

- /fc < 4L 2 P(^ C ) + m f \\w\\ 2 n i A c 
4L 2 + 12cr 2 , „ 2 



< = == + &r\ exp(-n/16). 

ny/Tr(logM + logn) V n 

The proof of the theorem is complete by taking e = 2/ (a — 1). 

A. 4. Proof of Theorem 4.3. First, notice that by definition of / and of 
the penalty pen(A) = 2Y^fL\ r n,j\^j\, 

M M M 

il/-/il^<i|fA-/ii^+E^l^- A ii+ 2 E^ii A il- 2 E^fe|. 

j=i j=i j=i 

Adding Ylj=i r n,j\^j — | to both sides of this inequality and arguing as in 
(A. 4), we get that, on the set A, for any A G R M , 

M 

ll/-/lln + E^I^-^I^H f A-/lln+4 E X J X J 



<\\fx-ft + 4y/M{\) E <^~H 2 - 

VieJ(A) 

Since EEi£j(A),j£J(A) (/*> fj)n&i ~ ~ ^j) > we have 

E < J l^-A,f=r 2 ||/-f A || 2 

ieJ(A) 

-r 2 n EE (hJM^-^-^) 
EE A,-) (A,- A ; j 

ieJ(\),j>i 
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< r 2 Jf - h\\l + 2r 2 nP (X) Y, ll/illnfe - A,- 

\j=i 

Recalling that r n j = r n \\fj\\ n and combining the last two displays, for A G 
R M with 4- v /2p(A)M(A) < 1, we obtain, on the set A, 

11/ - < IKa - f\\l + 4r„y / M(A)(||/ - f A || n + ||/ - /|| n ), 

which is inequality (A. 4) with k = 1/4. The remainder of the proof now 
parallels that of Theorem 4.1. □ 

A.5. Proof of Theorem 5.1. We proceed similarly to [40]. The proof is 
based on the following easy corollary of the Fano lemma (which can be 
obtained, e.g., by combining Theorems 2.2 and 2.5 in [41]). 

Lemma A.l. Let w be a loss function, A > be such that w(A) > 0, and 
let C be a set of functions on X of cardinality N = card(C) > 2 such that 

\\f-g\\l ,>4s 2 >0 Vf,geCJ?g, 

and the Kullback divergences K(¥f,¥ g ) between the measures Pj and F g 
satisfy 

K(P fl F g )< (1/16) log Vf,geC. 
Then for tp = s 2 /A we have 

infsupE^- 1 !!^ - f\\ 2 n ] > c lW (A), 

where inf^ n denotes the infimum over all estimators and c\> is a constant. 

The (S) aggregation case. Pick M disjoint subsets Si,...,Sm of {Xi, 
...,X n }, each Sj of cardinality log(M/D + 1) [w.l.o.g. we assume that 
\og{M / D + 1) is an integer] and define the functions 

J j ( x ) = l I {xes J } , j = 1, . . . ,M, 

where 7 < L is a positive constant to be chosen. Consider the set of functions 
V = {f A : A G A M ' D } where A M - D is the set of all A G R M such that D of 
the coordinates of A are equal to 1 and the remaining M — D coordinates 
are zero. Clearly, V C Fq. Thus, it suffices to prove the (S) lower bound 
of the theorem where the supremum over / G J-q is replaced by that over 
/ G V. Since A M > D C A M,D , for / G V we have mm XeAM , D ||f A - ff n = 0. 
Therefore, to finish the proof for the (S) case, it suffices to bound from below 
the quantity inf Tn sup /gV E / w(^j Vf ||r n - f\\ 2 n ) where ip n , M = Dlog(M/D + 
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l)/n. This will be done by applying Lemma A.l. In fact, note that for every 
two functions f\ and f\ in V we have 

( \ f>\ |, f f |,2 7 2 log(M/D + l) - 7 2 Dlog(M/£> + l) 
(A.6) ||tA-t x || n = p(A,A)< 



n n 



where p(A, A) = J2fLi I{\ ^\ } is the Hamming distance between A = (Ai, . . . , 

A M ) € A M ' D and A = (Ai, . . . , A M ) G A M > D . Lemma 4 in [10] (see also [22]) 
asserts that if M > 6D there exists a subset A' C A M,D such that, for some 
constant c > independent of M and -D, 

(A.7) log card(A') > cD log (— + 1^ 

and 

(A.8) p(X,\)>cD V A, AG A', A /A. 

Consider a set of functions C = {f \ : A G A'} C V. From (A.6) and (A.8), for 
any two functions f\ and in C we have 



|2 cr?Dlog(M/D + l) def, 2 



(A.9) ||f A -f- x \\l> — ■ ^ 4S 4^ 

Since the Wj's are iV(0, <r 2 ) random variables, the Kullback divergence K(Ff x , Pf x 
between Pf A and Pf- satisfies 



(A.10) ^(Pf A ,Pf x ) = ||f* " f Alln, j = 1, . . . , M. 



In view of (A.6) and (A. 10), one can choose 7 small enough to have 
K(F h ,F h ) < ±Dlog(^ + l) < ^logcard(A') = ^log card(C) 



for all A, A G A'. Now, to get the lower bound for the (S) case, it remains 
to use this inequality together with (A.9) and to apply Lemma A.l. Thus, 
the (S) lower bound is proved under the assumption that M > 6D, which is 
needed to assure (A.7) and (A.8). 

In the remaining case where D < M < 6D we use another construction. 
Note that it is enough to prove the result for Vn,M — D/n. We consider 
separately the cases D > 8 and 2 < D < 8. If D > 8 we consider the functions 
fj(x) = r yI{x=x ] }-> 3 = 1) • • • ; and a finite set of their linear combinations, 



I 7=1 



where O is the set of all vectors u G M M with binary coordinates ujj G {0, 1}. 
Since the supports of the /j's are disjoint, the functions g&U are uni- 
formly bounded by 7, thus U C Tq. Also, U C {fx ■ A G A A/,D } since at most 
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the first D functions fj are included in the linear combination. Clearly, 
min AgA A/,D ||f\ — /||^ = for any / GW. Therefore it remains to bound from 
below the quantity inf Tn sup feU E f w(ij;~* M \\T n - f\\ 2 n ), where ip n , M = D/n. 
To this end, we apply again Lemma A.l. 

Note that for any g\ = J2f=i ^jfj G U and g 2 = J2jLi ^jfj £ U we have 

(A.12) \\ 9l - g 2 f n = ~ ^? < I 2 D/n. 

5=1 

Since D > 8 it follows from the Varshamov-Gilbert bound (see [22] or [41], 
Chapter 2) that there exists a subset C of IA such that card(Z^o) > 2 D / 8 and 

(A.13) \\gi-92\\ 2 n >Ca 2 D/n 

for any (71,52 G C. Using (A. 10) and (A.12) we get, for any gi, g 2 G C, 

K(F gi ,F g2 ) < C 2l 2 D < C 3 7 2 log(card(C')), 

and choosing 7 small enough, we can finish the proof by applying Lemma 
A.l where we take C = C and act in the same way as above for M > 6D. 

Finally, if D < M < 6D and 2 < D < 8, we have t/Vi,m < 8/n, and the 
proof is easily obtained by choosing /1 = and /2 = 7n -1 / 2 and applying 
Lemma A.l to the set C = {/i, /2}. 

T/te (MS) aggregation case. We use the proof for (S) aggregation with 
D = 1. Note that A M>1 is the set of all the vertices of A M , except the vertex 
(0, . . . , 0). Thus, the proof for the (S) case with M > 6D and D = 1 gives us 
the required lower bound for the (MS) case, with the optimal rate ipn t M = 
(log M)/n. It remains to treat (MS) aggregation for M < 6. Then we have 
^n,M < (l°g7)/n, and we apply Lemma A.l to the set C = {f\>,f\"} where 
A' = (1, 0, . . . , 0) G A M , A" = (0, ... ,0, 1) € A M and f A is defined in the proof 
for the (S) case. Clearly, ||f A /-f A «||2 = 2 7 2 log(M + l)/n > 27 2 (log3)/n, and 
the result easily follows from (A. 10) and Lemma A.l. 

The (C) aggregation case. Consider the orthonormal trigonometric basis 
in L2[0, 1] defined by <f>i(x) = 1, 4>2k(x) = \/2cos(27rA;j;), 4>2k+i( x ) = \/2sin(27r x 
kx), k = 1,2,..., for xG [0,1]. Set 

n 

(A.14) f J (x) = 1 Y,Mk/n)I {x=Xk }, j = l,-..,M, 

k=l 

where 7 < Lj\[2 is a positive constant to be chosen. The system of functions 
{(f>j}j=l,...,M is orthonormal w.r.t. the discrete measure that assigns mass 1/n 
to each of the points k/n, k = 1, . . . ,n: 

1 - 

- 2J cf>j(k/n)4>i(k/n) =5 jh j,l = l,...,n, 
" k=i 
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where 5ji is the Kronecker delta (see, e.g., [41], Lemma 1.9). Hence 
(A.15) (f j ,fi)n = l 2 5 jh j,l = l,...,M, 

where (•, •)„ stands for the scalar product induced by || • || n . 

Assume first that M > y/n (i.e., we are in the "sparse" case). Define an 
integer 









1/2- 


m = 




n / log (^ + 1 ). 













for a constant C2 > chosen in such a way that M > 6m. Consider the finite 
set C C A M composed of such convex combinations of /i, . . • , /m that m of 
the coefficients Xj are equal to 1 /m and the remaining M — m coefficients 
are zero. In view of (A.15), for every pair of functions g±,g2 £ C we have 

(A.16) \\9i-92\\n <2 7 2 /m. 

To finish the proof for M > \fn it suffices now to apply line-by-line the 
argument after the formula (10) in [40] replacing there || ■ || by || • || n . Similarly, 
the proof for M < y/n is analogous to that given in [40], with the only 
difference that the functions /,• should be chosen as in (A. 14) and || • || 
should be replaced by || • || n . 

APPENDIX B: TECHNICAL LEMMAS 

Lemma B.l. Let f,fi,..., /m £ J~o and 1 < m < M . Let C be the finite 
set of functions defined in the proof of (3.5). Then (A.l) holds: 

(B.l) min \\g - ff n < mm ||f A - ff n + — . 

geC AeA M m 

Proof. Let /* be a minimizer of ||fA - f\\n over A G A M . Clearly, /* is 
of the form 

M M 

f* = J^Pjfj witn Pj ^ and J^PJ- 1 - 

i=i j=i 

Define a probability distribution on j = 0, 1, . . . , M by 

Pj, 

M 

Consider m i.i.d. random integers ji,...,j m where each jk is distributed 
according to on {0, 1, . . . ,M}. Introduce the random function 

rn 



f™ = -Y,9^ where ® = |^ if ^ Q _ 
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For every x € X the random variables gj 1 (x), . . . , gj m (x) are i.i.d. with 
E(g jk (x)) = f*(x). Thus, 

2\ 



E(/ m (x)-/*(x)) 2 = E| 



rn k=i 



L 2 

<-E(gl(x))<—. 



Hence for every x E X and every / £ To we get 
(B.2) 



E(f m (x) - f{x)f = E(f m (x) - f*(x)f + (/* (x) - f(x)) 2 



<- + (r(x)-/(r E )) 2 . 
m 

Integrating (B.2) over the empirical probability measure that puts mass 1/n 
at each X, and recalling the definition of /*, we obtain 

(B.3) E||/ m -/|| 2 < min ||/ A -/||2 + ^L. 

AeA A/ m 

Finally, note that the random function f m takes its values in C, which implies 
that 

E\\f m -f\\l>rmn\\g-f\\l 
g&C 

This and (B.3) prove (B.l). □ 

Lemma B.2. Let denote a random variable having the x 2 distribution 
with d degrees of freedom. Then for all x > 0, 

(B.4) P {Zj - rf >W25}<exp(- + ^ _ ). 



Proof. See [16], page 857. □ 
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